Geospatial Visualisation of US Census Data

This is the first in a series of notebooks explaining my learning process for completing a project in python. This project was based on US Census data provided by MuonNeutrino on kaggle (kaggle.com/muonneutrino/us-census-demographic-data). From this project I was able to visualise the county data on an interactive map, which was the subject of this first notebook. The second part used machine learning techniques to analyse the data, and is presented in the second notebook in this project.

So to visualise the data I needed some modules and data to plot the data on a map based on county ID. To do this I use some data I found on on the plotly website. (plotly.com/python/mapbox-county-choropleth/) This data was called in the initialisation cell.

And, well that's not right. Not all the states are there.

Upon looking at the data, I noticed that the states missing are the ones at the start of the list, which are the ones that the Census Id starts with 0. This was because the 0 was needed, but not included in the database.

At first I tried editing the csv file itself in excel, but this didn't help, as pandas still saved the values without the zeros. This also had a tendency to mess with the file, changing around some of the characters.

Instead the values were edited in the python code using pandas directly.

That's looking better, however, I couldn't help but notice that there were still certain grey spots on the map. To make things easier to investigate, I also edited hover menu to give more information on each county highlighted.

Upon investigation it can be seen that the greey spots in Utah and Florida could be explained by lakes, but the ones in South Dakota, Texas, Virigina and some others couldn't be explained so easily. From what I can tell they seem to be counties that don't manage themselves and are unorganised, so likely were excluded from the census data I used.

Now upon further examination there was a problem with the data in that a lot of the data was in raw values, rather than percentages. This means that maps of certain values tended to really just be population maps rather than maps which showing meaningful statistics.

This is not really the kind of statistic that is useful both for visualisations or data analysis. When wanting to know the amount of a value like men in a county, we are likely far more interested in the percentage of this value. To rectify this, we can divide the categories listing raw value by the total population.

Now we are given a far more meaningful map, showing where there are higher proportions of men, rather than just the population of each county.

There are other styles of map too, the following shows the whole world which is useful for this dataset in order to see Peurto Rico.

With the visualisation of the data set done I wanted to next focus on analysing the census data using machine learning methods.